Populations & Samples

Matthew Talluto

13.02.2022

Populations & Samples

Populations & Samples

Populations & Samples

Describing populations and samples

Aside: graphics in R

R currently has two dominant graphical engines

Base graphics

ggplot2

We will mostly use base graphics in the course

If you want to learn ggplot, ask Matt

Histograms: Base graphics

# generate some random numbers from the lognormal distribution
my_var = rlnorm(1000, log(10), 0.5)
hist(my_var)

Histograms: Base graphics

The defaults are not very nice, so lets improve things

hist(my_var, main = "", xlab="Variable of interest", ylab = "Frequency", 
     breaks = 20, col="#999999", border=NA)

Colors

head(colors())
## [1] "white"         "aliceblue"     "antiquewhite"  "antiquewhite1"
## [5] "antiquewhite2" "antiquewhite3"

Describing populations and samples

Summary statistics: location

The population mean (\(\mu\)) can be approximated with the sample mean:

\[ \mu \approx \bar{x} = \frac{1}{n}\sum_{i=1}^n x_i \]

mean(my_var)
## [1] 11.6

Summary statistics: location

Other location statistics

The mean can be strongly influenced by outliers.

median(my_var)
## [1] 10.2

Summary statistics: location

Other location statistics

The mean can be strongly influenced by outliers.

# mode
mode(my_var) # wrong!
## [1] "numeric"

# we can use the histogram function to approximate the sample mode
# changing the number of breaks will have a large impact on the results
my_hist = hist(my_var, breaks = 30, plot = FALSE)

# the mids variable gives you the midpoint of each bin
# counts gives you the count each bin
# cbind shows them together in columns
head(cbind(my_hist$mids, my_hist$counts))
##      [,1] [,2]
## [1,]    3   35
## [2,]    5  107
## [3,]    7  161
## [4,]    9  175
## [5,]   11  135
## [6,]   13  130

# use this to get the mode
# first find out which one is the tallest with which.max
(mode_position = which.max(my_hist$counts))
## [1] 4
my_hist$mids[mode_position]
## [1] 9

Summary statistics: location

Other location statistics

Comparing variables

We can compare variables in a way that is location independent by centering (subtracting the mean)

c(min(my_var), max(my_var))
## [1]  2.49 52.88
range(my_var)
## [1]  2.49 52.88

quantile(my_var, 0.4)
##  40% 
## 9.07

# Centre the variable
# note! arithmetic operators like `-` are vectorized!
my_var_ctr = my_var - mean(my_var)
mean(my_var_ctr)
## [1] 5.31e-17

Summary statistics: dispersion (or scale)

\[ \sigma^2 = \frac{1}{N}\sum_{i=1}^N (X_i-\mu)^2 \]

We can estimate \(\sigma^2\) using the sample variance:

\[ \sigma^2 \approx s^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i -\bar{x})^2 \]

It is convenient to talk about the scale of \(x\) in the same units as \(x\) itself, so we use the (population or sample) standard deviation:

\[ \sigma = \sqrt{\sigma^2} \approx s = \sqrt{s^2} \]

# Population variance -- biased if x is a sample!
sum((my_var - mean(my_var))^2)/length(my_var)
## [1] 37.8

# These R functions always produce the sample variance and sd
var(my_var)
## [1] 37.9
sd(my_var)
## [1] 6.15

Summary statistics: dispersion (or scale)

Other dispersion statistics

# the range function gives you the min and max
# Take the difference to get the statistical range
diff(range(my_var))
## [1] 50.4
max(my_var) - min(my_var)
## [1] 50.4

IQR(my_var)
## [1] 6.69
quantile(my_var, 0.75) - quantile(my_var, 0.25)
##  75% 
## 6.69

# coefficient of variation can be done manually
sd(my_var)/mean(my_var)
## [1] 0.531

Summary statistics: dispersion (or scale)

boxplot(my_var)

Summary statistics: higher moments

Asymmetry: skewness

Is the distribution weighted to one side or the other?

\[ \mathrm{skewness} = \frac{\sum_{i=1}^{n}(x_i-\bar{x})^3}{(n-1)s^3} \]

Tailedness: kurtosis

How fat are the tails relative to a normal distribution?

\[ \mathrm{kurtosis} = \frac{\sum_{i=1}^{n}(x_i-\bar{x})^4}{(n-1)s^4} \]

More on boxplots

boxplot(body_mass_g ~ species, data = penguins, boxwex=0.4, notch = TRUE)

More on boxplots

boxplot(body_mass_g ~ species, data = penguins, boxwex=0.4, notch = TRUE)

Repeating yourself

# bad!
(col_means = c(
  mean(penguins[,3]),
  mean(penguins[,4]),
  mean(penguins[,5]),
  mean(penguins[,3])  # it's easy to introduce mistakes this way!
))
## [1]  44.0  17.2 201.0  44.0

Applying functions

# bad!
(col_means = c(
  mean(penguins[,3]),
  mean(penguins[,4]),
  mean(penguins[,5]),
  mean(penguins[,3])  # it's easy to introduce mistakes this way!
))
## [1]  44.0  17.2 201.0  44.0

# better
# use columns 3:6 because these are the numeric columns
# doesn't make sense to take the mean of penguins$species
(col_means = sapply(penguins[,3:6], mean)) # apply the function mean to each variable
##    bill_length_mm     bill_depth_mm flipper_length_mm       body_mass_g 
##              44.0              17.2             201.0            4207.1

Applying functions

sapply(penguins[,3:6], range)
##      bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## [1,]           32.1          13.1               172        2700
## [2,]           59.6          21.5               231        6300
# here, we pass an additional argument named probs to quantile
# see ?quantile for what this does
sapply(penguins[,3:6], quantile, probs = c(0.25, 0.5, 0.75))
##     bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## 25%           39.5          15.6               190        3550
## 50%           44.5          17.3               197        4050
## 75%           48.6          18.7               213        4775

Tabular applies

# with: use the variables found in the data.frame penguins
# allows us to not write penguins$species to access the variable species
with(penguins, 
     # compute the mean of bill_length_mm for each of the three species
     tapply(bill_length_mm, species, mean)
)
##    Adelie Chinstrap    Gentoo 
##      38.8      48.8      47.6

Probability distributions

Thought experiment

You and 99 of your closest friends gather on the centre line of a 100-m football pitch. We define the centre line as 0m, the western boundary -50 m, and the eastern boundary as +50 m.

Probability distributions

Thought experiment

You and 99 of your closest friends gather on the centre line of a 100-m football pitch. We define the centre line as 0m, the western boundary -50 m, and the eastern boundary as +50 m.

Each person flips a fair coin. If the coin is heads, they take a step east (add 0.5 m to their location), if its tails, they take a step west (subtract 0.5 m from their location).

Question: What is the long-run distribution of positions on the field?

Probability distributions

Thought experiment

You and 99 of your closest friends gather on the centre line of a 100-m football pitch. We define the centre line as 0m, the western boundary -50 m, and the eastern boundary as +50 m.

Each person flips a fair coin. If the coin is heads, they take a step east (add 0.5 m to their location), if its tails, they take a step west (subtract 0.5 m from their location).

Question: What is the long-run distribution of positions on the field?

t = 10

Probability distributions

Thought experiment

You and 99 of your closest friends gather on the centre line of a 100-m football pitch. We define the centre line as 0m, the western boundary -50 m, and the eastern boundary as +50 m.

Each person flips a fair coin. If the coin is heads, they take a step east (add 0.5 m to their location), if its tails, they take a step west (subtract 0.5 m from their location).

Question: What is the long-run distribution of positions on the field?

t = 50

Probability distributions

Thought experiment

You and 99 of your closest friends gather on the centre line of a 100-m football pitch. We define the centre line as 0m, the western boundary -50 m, and the eastern boundary as +50 m.

Each person flips a fair coin. If the coin is heads, they take a step east (add 0.5 m to their location), if its tails, they take a step west (subtract 0.5 m from their location).

Question: What is the long-run distribution of positions on the field?

t = 500

Probability distributions

We can simulate this experiment in R. Here is code for doing it for one person:

position = 0
for(i in 1:100) {
  coin_flip = sample(c("heads", "tails"), 1)
  if(coin_flip == "heads") {
    position = position + 0.5
  } else {
    position = position - 0.5
  }
}
position
## [1] 1

Probability distributions

And for a larger sample (2500) friends, doing more coin flips (500 each), with more concise code.

# repeat the initial position 2500 times, one per friend
n_friends = 2500
positions = rep(0, n_friends)

# repeat the experiment 500 times
time_steps = 1:500
for(i in time_steps) {
    # flip a coin for each friend
    # returns one heads or tails for each n_friends
    coin_flips = sample(c("heads", "tails"), size = n_friends, replace = TRUE)
    # compute the steps: +0.5 if the flip is heads, -0.5 if tails
    moves = ifelse(coin_flips == "heads", 0.5, -0.5)
    # finally add the moves to the old positions and save the new position
    positions = positions + moves
}
hist(positions, breaks=40, col="gray", main="")

Probability distributions: PDFs

hist(positions, breaks=40, col="gray", main = "", freq=FALSE)

Probability distributions: PDFs

hist(positions, breaks=40, col="gray", main = "", freq=FALSE)
lines(density(positions, adjust=1.5), col='red', lwd=2)

Normal distributions

hist(positions, breaks=40, col="gray", main = "", freq=FALSE)
lines(density(positions, adjust=1.5), col='red', lwd=2)
mu = mean(positions)
sig = sd(positions)
x_norm = seq(min(positions), max(positions), length.out = 400)
y_norm = dnorm(x_norm, mu, sig)
lines(x_norm, y_norm, lwd=2, col='blue')
legend("topright", legend=c(paste("sample mean =", round(mu, 2)), 
                            paste("sample sd =", round(sig, 2))), lwd=0, bty='n')

Normal distributions

Normal distributions

Normal distributions

\[ \mathcal{f}(x) = \frac{1}{\sigma \sqrt{2\pi}} \mathcal{e}^{-\frac{1}{2} \left (\frac{x-\mu}{\sigma} \right )^2} \]

Normal distributions: area under the curve

Normal distributions: area under the curve

Normal distributions: CDF

\[ \mathcal{g}(x) = \int_{-\infty}^{x} \frac{1}{\sigma \sqrt{2\pi}} \mathcal{e}^{-\frac{1}{2} \left (\frac{x-\mu}{\sigma} \right )^2} dx \]

Distributions in R

PDF: what is the probability density when \(x=3\) (the height of the bell curve)

dnorm(x = 3, mean = 0, sd = 1)
## [1] 0.00443

CDF: what is the cumulative probability when \(x=q\)

(area under the bell curve from \(-\infty\) to \(q\))

(probability of observing a value < \(q\))

pnorm(q = -1.96, mean = 0, sd = 1)
## [1] 0.025

Quantiles: what is the value of \(x\), such that the probability of observing x or smaller is \(p\)

(inverse of the CDF)

qnorm(p = 0.025, mean = 0, sd = 1)
## [1] -1.96

RNG: Random number generator, produces \(n\) random numbers from the desired distribution

rnorm(n = 10, mean= 0, sd = 1)
##  [1]  0.606 -1.230 -1.041  0.362  1.731 -0.855 -0.579  1.140  0.852  0.711

R supports many distributions, we will discuss others as they come up.

Sampling error

Standard error of the mean

\[ \sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}} \approx \frac{s}{\sqrt{n}} \]

Central limit theorem